Working with text

Sample from http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

The data set is '20 newsgroups dataset' a dataset used for testing machine learning accuracy described at: 20 newsgroups dataset website.

We will be using this data to show scikit learn.

To make the samples run more quickly we will be limiting the example data set to just 4 categories.


In [1]:
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']

Load in the training set of data


In [2]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train',categories=categories, shuffle=True, random_state=42)

In [3]:
twenty_train.target_names


Out[3]:
['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

Note target names not in same order as in the categories array

Count of documents


In [4]:
len(twenty_train.data)


Out[4]:
2257

Show the first 8 lines of text from one of the documents formated with line breaks


In [5]:
print("\n".join(twenty_train.data[0].split("\n")[:8]))


From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to

Path to file on your machine


In [6]:
twenty_train.filenames[0]


Out[6]:
'C:\\Users\\Peter\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.graphics\\38440'

Show the the targets categories of first 10 documents. As a list and show there names.


In [7]:
print(twenty_train.target[:10])
for t in twenty_train.target[:10]:
    print(twenty_train.target_names[t])


[1 1 3 3 3 3 3 2 2 2]
comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med

Lets look at a document in the training data.


In [8]:
print("\n".join(twenty_train.data[0].split("\n")))


From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.

Extracting features from text files

So for machine learning to be used text must be turned into numerical feature vectors.

What is a feature vector?

  • Each word is assigned an integer identifier
  • Each document is assigned an integer identifier

The results are stored in scipy.sparse matrices.

Tokenizing text with scikit-learn

Using CountVectorizer we load the training data into a spare matrix.

What is a spare matrix?


In [9]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape


Out[9]:
(2257, 35788)

In [10]:
X_train_counts.__class__


Out[10]:
scipy.sparse.csr.csr_matrix

Using a CountVectorizer method we can get the integer identifier of a word.


In [11]:
count_vect.vocabulary_.get(u'application')


Out[11]:
5285

With this identifier we can get the count of the word in a given document.


In [12]:
print("Word count for application in first document: {0} and last document: {1} ").format(
    X_train_counts[0, 5285], X_train_counts[2256, 5285])


Word count for application in first document: 1 and last document: 0 

In [13]:
count_vect.vocabulary_.get(u'subject')


Out[13]:
31077

In [14]:
print("Word count for email in first document: {0} and last document: {1} ").format(
    X_train_counts[0, 31077], X_train_counts[2256, 31077])


Word count for email in first document: 1 and last document: 1 

In [15]:
count_vect.vocabulary_.get(u'to')


Out[15]:
32493

In [16]:
print("Word count for email in first document: {0} and last document: {1} ").format(
    X_train_counts[0, 32493], X_train_counts[2256, 32493])


Word count for email in first document: 4 and last document: 0 

What are two problems with using a word count in a document?

From occurrences to frequencies

$\text{Term Frequencies tf} = \text{occurrences of each word} / \text{total number of words}$

tf-idf is "Term Frequencies times Inverse Document Frequency"

Calculating tfidf


In [17]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tfidf_2stage = tf_transformer.transform(X_train_counts)
X_train_tfidf_2stage.shape


Out[17]:
(2257, 35788)

.fit(..) to fit estimator to the data .transform(..) to transform the count matrix to tf-idf

It is possible to merge the fit and transform stages using .fit_transform(..)

Calculate tfidf


In [18]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape


Out[18]:
(2257, 35788)

In [19]:
print("In first document tf-idf for application: {0} subject: {1} to: {2}").format(
    X_train_tfidf[0, 5285], X_train_tfidf[0, 31077], X_train_tfidf[0, 32493])


In first document tf-idf for application: 0.0841345440909 subject: 0.0167978060212 to: 0.0728377394162

Training a classifier

So we now have features. We can train a classifier to try to predict the category of a post. First we will try the naïve Bayes classifier.


In [20]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

Here tfidf_transformer is used to classify


In [21]:
docs_new = ['God is love', 'Heart attacks are common', 'Disbelief in a proposition', 'Disbelief in a proposition means that one does not believe it to be true', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
     print('%r => %s' % (doc, twenty_train.target_names[category]))


'God is love' => soc.religion.christian
'Heart attacks are common' => sci.med
'Disbelief in a proposition' => alt.atheism
'Disbelief in a proposition means that one does not believe it to be true' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics

We can see it get some right but not all.

Building a pipeline

Here we can put all the stages together in a pipeline. The names 'vect', 'tfidf' and 'clf' are arbitrary.


In [22]:
from sklearn.pipeline import Pipeline
text_clf_bayes = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])

In [23]:
text_clf_bayes_fit = text_clf_bayes.fit(twenty_train.data, twenty_train.target)

Evaluation


In [24]:
import numpy as np
twenty_test = fetch_20newsgroups(subset='test',
    categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted_bayes = text_clf_bayes_fit.predict(docs_test)
np.mean(predicted_bayes == twenty_test.target)


Out[24]:
0.83488681757656458

Try a support vector machine instead


In [25]:
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer()),
                         ('tfidf', TfidfTransformer()),
                         ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-3, n_iter=5, random_state=42)),])
text_clf_svm_fit = text_clf_svm.fit(twenty_train.data, twenty_train.target)
predicted_svm = text_clf_svm_fit.predict(docs_test)
np.mean(predicted_svm == twenty_test.target)


Out[25]:
0.9127829560585885

We can see the support vector machine got a higher number than naïve Bayes. What does it mean? We move on to metrics.

Using metrics

Classification report & Confusion matix

Here we will use a simple example to show classification reports and confusion matrices.

  • y_true is the test data
  • y_pred is the prediction

In [26]:
from sklearn import metrics

y_true = ["cat", "ant", "cat", "cat", "ant", "bird", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat", "bird"]
print(metrics.classification_report(y_true, y_pred,
    target_names=["ant", "bird", "cat"]))


             precision    recall  f1-score   support

        ant       0.67      1.00      0.80         2
       bird       1.00      0.50      0.67         2
        cat       0.67      0.67      0.67         3

avg / total       0.76      0.71      0.70         7

Here we can see that the predictions:

  • found ant 3 times and should have found it twice hence precision of 0.67.
  • never predicted ant when shouldn't have hence recall of 1.
  • f1 source is the mean of precision and recall
  • support of 2 meaning there were 2 in the true data set.

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html

Confusion matix


In [ ]:
metrics.confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])

In the confusion_matrix the labels give the order of the rows.

  • ant was correctly categorised twice and was never miss categorised
  • bird was correctly categorised once and was categorised as cat once
  • cat was correctly categorised twice and was categorised as an ant once

In [ ]:
metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)

Back to '20 newsgroups dataset'


In [ ]:
print(metrics.classification_report(twenty_test.target, predicted_svm,
    target_names=twenty_test.target_names))

We can see where the 91% score came from.


In [ ]:
# We got the evaluation score this way before:
print(np.mean(predicted_svm == twenty_test.target))
# We get the same results using metrics.accuracy_score
print(metrics.accuracy_score(twenty_test.target, predicted_svm, normalize=True, sample_weight=None))

Now lets see the confusion matrix.


In [ ]:
print(twenty_train.target_names)

metrics.confusion_matrix(twenty_test.target, predicted_bayes)

So we can see the naïve Bayes classifier got a lot more correct in some cases but also included a higher proportion in the last category.


In [ ]:
metrics.confusion_matrix(twenty_test.target, predicted_svm)

We can see that atheism is miss categorised as Christian and science and medicine as computer graphics a high proportion of the time using the support vector machine.

Parameter tuning

Transformation and classifiers can have various parameters. Rather than manually tweaking each parameter in the pipeline it is possible to use grid search instead.

Here we try a couple of options for each stage. The more options the longer the grid search will take.


In [ ]:
from sklearn.grid_search import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-3, 1e-4),
}

In [ ]:
gs_clf = GridSearchCV(text_clf_svm_fit, parameters, n_jobs=-1)

Running the search on all the data will take a little while 10-30 seconds on a new ish desktop with 8 cores. If you don't want to wait that long uncomment the line with :400 and comment out the other.


In [ ]:
#gs_clf_fit = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])
gs_clf_fit = gs_clf.fit(twenty_train.data, twenty_train.target)

In [ ]:
best_parameters, score, _ = max(gs_clf_fit.grid_scores_, key=lambda x: x[1])
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, best_parameters[param_name]))
score

Well that is a significant improvement. Lets use these new parameters.


In [ ]:
text_clf_svm_tuned = Pipeline([('vect', CountVectorizer(ngram_range=(1, 2))),
                     ('tfidf', TfidfTransformer(use_idf=True)),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=0.0001, n_iter=5, random_state=42)),
])
text_clf_svm_tuned_fit = text_clf_svm_tuned.fit(twenty_train.data, twenty_train.target)
predicted_tuned = text_clf_svm_tuned_fit.predict(docs_test)
metrics.accuracy_score(twenty_test.target, predicted_tuned, normalize=True, sample_weight=None)

In [ ]:
for x in gs_clf_fit.grid_scores_:
    print x[0], x[1], x[2]

Moving on from that lets see where the improvements where made.


In [ ]:
print(metrics.classification_report(twenty_test.target, predicted_svm,
    target_names=twenty_test.target_names))

metrics.confusion_matrix(twenty_test.target, predicted_svm)

In [ ]:
print(metrics.classification_report(twenty_test.target, predicted_tuned,
    target_names=twenty_test.target_names))

metrics.confusion_matrix(twenty_test.target, predicted_tuned)

We see comp.graphics is the only category to see a drop in prediction the other have improved.

Conclusion

We can see that scikit learn can do a good job in classification with the amount of training and test data in this simple example.

  1. Can you see a use in your project?
  2. What issues can you see with the training and test data?

In [ ]: